The Academy Awards, also known as the Oscars, is arguably the ultimate award for any movie. I plan to use the data I have collected to investigate the relationship between Oscars and money. Do Oscar winning movies cost more to make than movies that lose Oscars? Do both of these sets of movies cost more than movies that are not nominated? Do Oscar winners, Oscar loser, or non-nominated movies make more gross revenue? Which of these has the highest percentage profit? I will use the data to answer these questions and more. Along the way, I will also try to highlight some interesting statistics and facts. By analyzing this data, I hope to shed some light on whether production companies pay for Oscars, profit by making Oscars, both, or neither.
Also, by choosing a fun topic I hope that people looking find answers that they enjoy thinking about. It is certainly true that I had fun looking for my own answers.
Most of these 8 libraries are used extensively, while a couple have only one or two functions that came in handy for things I wanted to accomplish in the presentation.
library(tidyverse) # multiple tidy packages
library(readr) # for reading files
library(ggplot2) # for plotting
library(plotly) # for interactive plotting
library(dplyr) # data manipulation
library(flextable) # pretty tables
library(formattable) # comma function formatting numbers
library(gridExtra) # side by side plotsHere I read in the four data sets I wound up using. I give a brief explanation of each below.
budget <- read_csv('movies_budget.csv') # budget including new movies
oscars <- read_csv('the_oscar_award.csv') # oscar winners and losers
meta <- read_csv('movies_metadata.csv') # majority of data used
inflation <- read_csv('inflation_data.csv')
# inflation data for calculationsI used alexsychu kaggle to gather some more data up to 2020. This second dat set had an additional 7,668 rows of data. This user was trying to answer the question of how to predict if a movie will do well.
Lastly for movie data, I used data from unanimad kaggle . This data set had 10,395 rows of data and this kaggle user asked various questions about who won Oscars.
There were many renames and some work to choose a movie budget in some cases. Once I had the columns I wanted, I set in on joining the data together. First, I joined the meta data with the budget data to include all of the titles for which I had information. My next adventure (or misadventure) was to join this data with the inflation data. Once that was done, all that was left was joining this information with the Oscar data. All in all, I used three full joins.
oscars <- oscars %>% # rename year for join
rename('year' = 'year_film')
meta$year <- format(as.Date(meta$release_date, format = "%m/%d/%Y"), "%Y")
blue = '#000080' # just the color code I wanted to use for my print
meta <- meta %>%
mutate(year = as.numeric(year)) # easier use for comparisons/inflation
# renames
budget <- budget %>%
rename(vote_average = score) %>% # renaming vote data for joins etc
rename(vote_count = votes) %>%
rename(Title = 'Movie Title')
oscars <- oscars %>%
rename(Title = 'film') # rename for joins
adj <- inflation %>%
mutate(multiplier = (22.82/amount))
# create multiplier column for easy calculations
budget <- budget %>%
rename(new_budget = Budget) # rename for joins etc
options(scipen = 100) # avoid scientific notation
full_budget <- full_join(meta, budget, on = 'Title')
# full join of metadata and budget data to get new and old movies etc
full_bud <- full_budget %>%
mutate(budget = pmax(new_budget, meta_budget, na.rm = T)) %>%
select(Title,genres, budget,new_budget,
meta_budget, popularity,year,
release_date, revenue, runtime, vote_average,
vote_count,gross)
# set budget to max of two different budgets.
# picking max is arbitrary, but needed in most cases
full_bud <- full_bud[-c(1,2,3),] %>%
arrange(desc(as.numeric(budget)))
# remove first three unnecessary rows
full_bud <- full_bud %>% # replace 1900 sentinels with 2022
mutate(year = replace(year, year == 1900, 2022))
# Most of these seemed to be less known movies
# so I thought that 2022 would do the least harm
# with the inflation numbers
adj_bud <- full_join(full_bud, adj,
on = c('Title', 'year')) %>%
mutate(with_inflation = (as.numeric(budget) * (multiplier))) %>%
mutate(gross_inflation = (as.numeric(full_bud$gross)
* (multiplier))) %>%
select(Title,genres,vote_average, vote_count, budget,
with_inflation, gross_inflation, year, gross,
release_date) %>%
arrange(desc(with_inflation)) # joining movies with inflation
# creating with_inflation(budget) and gross_inflation columns
# arranging by highest with_inflation budgets
osc_bud <- full_join(adj_bud, oscars, on = c('Title'))
# joining all other data to Oscars data
# This is the starting point for most of my data manipulation
# 62706 entries, 15 total columns, of which 8 columns used.I wound up only using 8 of the 15 columns of my final data set. A few of these columns were created by manipulating other data. I had planned to look at more aspects, but I eventually realized that was too much for this presentation. Maybe someone else will see what I have done, and decide to dig deeper. Maybe I will come back to it someday myself.
oscar_titles <- oscars %>% # count number of titles in oscar data
distinct(Title)
oscar_wn_titles <- oscars %>% # count number of oscar winning movies
filter(winner == T) %>%
distinct(Title)
total_noms <- osc_bud %>% # count total number of nominations
filter(!is.na(winner))
total_wins <- osc_bud %>% # count total number of awards won
filter(winner == T)
bst_pic <- osc_bud %>%
group_by(Title) %>%
filter((str_detect(category, 'PIC') |
str_detect(category, 'OUT'))) %>%
select(Title, year,budget, with_inflation,
gross_inflation, gross, genres, vote_count,
vote_average, category, winner) %>%
distinct(Title, .keep_all = T)
# select best picture nominees in its many forms
bst_pic_won <- bst_pic %>% # count best picture winners
filter(winner == T) %>%
distinct(Title, .keep_all = T)
bst_act <- osc_bud %>% # count best actor/actress titles
group_by(Title) %>%
filter(str_detect(category, 'ACT')) %>%
select(Title, genres, budget, category, winner)
bst_act_wn <- bst_act %>%
filter(!is.na(winner)) %>%
filter(winner == T) %>%
distinct(Title, .keep_all = T) # count winners of actor/actressThe code above is used to find the numbers contained here. The data I was able to collect contains 4,834 movies nominated for an Academy Award since its inception in 1929. Of these 4,834 movies. 1,274 won at least one award. There have been 13,312 total Oscar nominations, and 3,001 total Oscar wins in the dataset. 559 movies have been nominated for Best Picture in its many forms. Out of these, 92 won. 1,154 movies have had an actor or actress nominated in either a leading or supporting role. 313 movies had at least one winner in an acting category.
The first graph is a scatter plot of budget vs gross. Because there were so many, I also wanted to make the scatter plot with non-nominated movies removed. The last plot shows the same information with inflation. The black lines are the break even line.
osc_bud <- osc_bud[!is.na(osc_bud$budget), ] # remove na budgets
osc_bud <- osc_bud %>%
mutate(budget = as.numeric(budget)) # make budget numeric
my_theme = theme(axis.text.x = element_text(angle = -90,
size = 5, color = blue ),
axis.text.y = element_text(color = blue)) +
theme(axis.title.x = element_text(color = blue),
axis.title.y = element_text(color = blue))
p <- osc_bud %>% select(Title, genres, gross, budget, winner) %>%
ggplot(aes(x = budget, y = gross, color = winner, label = Title), alpha = 0.1) +
geom_abline(intercept = 0, slope = 1) +
geom_point() +
my_theme # budget vs gross
ggplotly(p)p2 <- osc_bud %>% select(Title, genres, gross, budget, winner) %>%
filter(!is.na(winner)) %>%
ggplot(aes(x = budget, y = gross, color = winner, label = Title), alpha = 0.1) +
geom_abline(intercept = 0, slope = 1) +
geom_point() +
my_theme # non-nominated removed
ggplotly(p2)p3 <- osc_bud %>% select(Title, genres, gross, budget,with_inflation, gross_inflation, winner) %>%
filter(!is.na(winner)) %>%
ggplot(aes(x = with_inflation, y = gross_inflation, color = winner, label = Title), alpha = 0.1) +
geom_abline(intercept = 0, slope = 1) +
geom_point() +
my_theme
ggplotly(p3) # with inflationThe scatter plot above plots budget vs gross. Oscar winners are in blue, Oscar losers in red and non-nominated in grey. These colors will mean the same thing throughout this report.
high_prof <- osc_bud %>%
filter(budget != 0) %>%
mutate(pct_prof = 100 * (as.numeric(gross)/as.numeric(budget))) %>%
select(Title, budget, gross, pct_prof) %>%
arrange(desc(pct_prof))
flextable(head(high_prof, 25))Title | budget | gross | pct_prof |
Paranormal Activity | 15,000 | 193,355,800 | 1,289,038.667 |
The Blair Witch Project | 60,000 | 248,639,099 | 414,398.498 |
The Gallows | 100,000 | 42,964,410 | 42,964.410 |
El Mariachi | 7,000 | 2,040,920 | 29,156.000 |
Once | 150,000 | 20,936,722 | 13,957.815 |
Clerks | 27,000 | 3,151,130 | 11,670.852 |
Napoleon Dynamite | 400,000 | 46,138,887 | 11,534.722 |
In the Company of Men | 25,000 | 2,804,473 | 11,217.892 |
Keeping Mum | 169,000 | 18,586,834 | 10,998.127 |
Open Water | 500,000 | 54,683,487 | 10,936.697 |
The Devil Inside | 1,000,000 | 101,758,490 | 10,175.849 |
The Quiet Ones | 200,000 | 17,835,162 | 8,917.581 |
Saw | 1,200,000 | 103,911,669 | 8,659.306 |
Searching | 880,000 | 75,462,037 | 8,575.231 |
Primer | 7,000 | 545,436 | 7,791.943 |
E.T. the Extra-Terrestrial | 10,500,000 | 792,910,554 | 7,551.529 |
My Big Fat Greek Wedding | 5,000,000 | 368,744,044 | 7,374.881 |
The Full Monty | 3,500,000 | 257,938,649 | 7,369.676 |
The Full Monty | 3,500,000 | 257,938,649 | 7,369.676 |
The Full Monty | 3,500,000 | 257,938,649 | 7,369.676 |
The Full Monty | 3,500,000 | 257,938,649 | 7,369.676 |
Friday the 13th | 550,000 | 39,754,601 | 7,228.109 |
Fireproof | 500,000 | 33,473,297 | 6,694.659 |
Insidious | 1,500,000 | 99,557,032 | 6,637.135 |
Unfriended | 1,000,000 | 62,882,090 | 6,288.209 |
low_prof <- high_prof %>%
na.omit(budget) %>%
arrange(pct_prof) %>%
distinct(Title, .keep_all = T)
flextable(head(low_prof, 25))Title | budget | gross | pct_prof |
Trojan War | 15,000,000 | 309 | 0.002060000 |
Madadayo | 11,900,000 | 596 | 0.005008403 |
Ginger Snaps | 5,000,000 | 2,554 | 0.051080000 |
Philadelphia Experiment II | 5,000,000 | 2,970 | 0.059400000 |
The Lovers on the Bridge | 28,000,000 | 29,679 | 0.105996429 |
Savior | 10,000,000 | 14,328 | 0.143280000 |
Tanner Hall | 3,000,000 | 5,073 | 0.169100000 |
Crimewave | 3,000,000 | 5,101 | 0.170033333 |
Deadfall | 10,000,000 | 18,369 | 0.183690000 |
Hell's Kitchen | 6,000,000 | 11,710 | 0.195166667 |
Barefoot | 6,000,000 | 15,071 | 0.251183333 |
Freaked | 11,000,000 | 29,296 | 0.266327273 |
Parasite | 800,000 | 2,270 | 0.283750000 |
Passion Play | 8,000,000 | 25,603 | 0.320037500 |
About Cherry | 2,500,000 | 8,315 | 0.332600000 |
Rock & Rule | 8,000,000 | 30,379 | 0.379737500 |
Best Laid Plans | 7,000,000 | 27,816 | 0.397371429 |
Brenda Starr | 16,000,000 | 67,878 | 0.424237500 |
O.C. and Stiggs | 7,000,000 | 29,815 | 0.425928571 |
My Summer Story | 15,000,000 | 70,936 | 0.472906667 |
The Boondock Saints | 6,000,000 | 30,471 | 0.507850000 |
Vamps | 16,000,000 | 92,748 | 0.579675000 |
Love Ranch | 25,000,000 | 146,149 | 0.584596000 |
Arizona Dream | 19,000,000 | 112,547 | 0.592352632 |
The Irishman | 159,000,000 | 968,853 | 0.609341509 |
osc_bud <- osc_bud %>%
mutate(budget = as.numeric(budget)) %>%
mutate(gross = as.numeric(gross))
# may be at least in part redundant but to be sure numeric
avg_prof <- osc_bud %>%
filter(!is.na(budget)) %>%
filter(!is.na(gross)) %>%
mutate(avg_profit = (100*sum(gross))/(sum(budget))) %>%
mutate(percent = 100 * (gross/budget))
avg_profit <- head(avg_prof$avg_profit, 1)
win_prof <- osc_bud %>%
filter(winner == T) %>% # calculate percent profit Oscar winners
filter(!is.na(budget)) %>%
filter(!is.na(gross)) %>%
mutate(winner_profit = (100*sum(gross))/(sum((budget)))) %>%
mutate(percent = 100 * (gross/budget))
lsr_prof <- osc_bud %>%
filter(winner == F) %>% # calculate percent profit Oscar losers
filter(!is.na(budget)) %>%
filter(!is.na(gross)) %>%
mutate(loser_profit = 100*sum(gross/sum(budget))) %>%
mutate(percent = 100 *(gross/budget))
no_nom_prof <- osc_bud %>%
filter(is.na(winner)) %>% # calculate percent profit non-nominated
filter(!is.na(budget)) %>%
filter(!is.na(gross)) %>%
mutate(nonnom_profit = (100*sum(gross))/(sum(budget))) %>%
mutate(percent = 100 * (gross/budget))
money_plot <- osc_bud %>% select(Title, genres, gross,
budget,with_inflation,
gross_inflation, winner) %>%
ggplot() +
geom_density(aes((budget), fill = winner), alpha = 0.4) +
my_theme +
xlim(0, 60000000) +
ylim(0, 0.00000035) # budget density plot
ggplotly(money_plot)money_plot2 <- osc_bud %>% select(Title, genres,
gross, with_inflation,
budget, winner,
gross_inflation) %>%
ggplot() +
geom_density(aes(with_inflation, fill = winner), alpha = 0.4)+
xlim(0, 60000000) +
ylim(0, 0.00000035) +
my_theme # budget with inflation density plot
ggplotly(money_plot2)grid.arrange(money_plot, money_plot2, ncol=2) # side by side budget plotsprof_plot <- osc_bud %>% select(Title, genres, gross,
budget,with_inflation,
gross_inflation, winner) %>%
ggplot() +
geom_density(aes((gross), fill = winner), alpha = 0.4) +
xlim(0, 200000000) +
ylim(0, 0.000000070) +
my_theme # plot gross revenue
ggplotly(prof_plot)prof_plot2 <- osc_bud %>% select(Title, genres, gross,
budget,gross_inflation, winner) %>%
ggplot() +
geom_density(aes(gross_inflation, fill = winner), alpha = 0.4) + xlim(0, 200000000) +
ylim(0, 0.000000070) +
my_theme
ggplotly(prof_plot2) # plot gross inflationgrid.arrange(prof_plot, prof_plot2, ncol=2) # side by side gross plotsgrid.arrange(money_plot, prof_plot, ncol=2) # spent vs made chartsgrid.arrange(money_plot2, prof_plot2, ncol=2)# same chart with inflationbst_pic_budgets <- bst_pic_won %>%
filter(budget > 0) %>%
mutate(year_title = paste(year, ', ', Title))
# eliminate the few best (less than 10 I think) pics with 0 or na budget
r <- bst_pic_budgets %>%
ggplot() +
geom_col(aes(x = year_title, y = with_inflation), fill= blue) +
my_theme # best picture column plot
ggplotly(r) min_bud <- bst_pic_won %>%
filter(with_inflation > 0) %>%
filter(with_inflation == min(with_inflation)) %>%
select(Title, with_inflation) %>%
arrange(with_inflation) # determine minimum of best pic budgets
min_bud_title <- head(min_bud$Title, 1) # title of min cost best pic
min_bud_cost <- head(min_bud$with_inflation, 1) # min cost best pic
avg_bst <- mean(min_bud$with_inflation) # average cost of best pic
max_bud <- bst_pic_won %>%
filter(with_inflation == max(with_inflation)) %>%
select(Title, with_inflation) %>%
arrange(desc(with_inflation)) # determine most expensive best pic
max_bud_title <- head(max_bud$Title, 1) # Title of most expensive
max_bud_cost <- head(max_bud$with_inflation, 1) # budget of most expensiveThe minimum budget for a Best Picture winner with inflation was Marty with a cost of $3,674,769.95. The average budget for a Best Picture winner with inflation was $51,335,637.82. The maximum budget for a Best Picture winner with inflation was Titanic with a cost of $358,241,758.24.
I wanted to look at the relationship between the Oscars and money. Looking at the data, it seems clear that Oscar winning movies both tend to cost more to make and have more profits than Oscar losing movies. Following that trend, Oscar losing movies tent to cost more and have more profits than movies that aren’t nominated at all. These results were not surprising, as I would hope that award winning movies would tend to be “better” than movies that win no awards. Still, it hard to say which comes first. Maybe with further investigation into how the money for these movies was spent, might shed more light on an answer. Lastly, it is clear from the Best Picture column chart that a high budget is not a requirement for being an Oscar winner. Lastly I’d just like to add that the movie business seems to be very lucrative. The average percent profit for these movies was 365.32. In essence, this means that a $50,000,000 movie will gross on average 182,659,561.70.